Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
File descriptions
train.csv - the training set
test.csv - the test set
data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms
Data fields
Here’s a brief version of what you’ll find in the data description file.
SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale
Predictive Models
Adjusted R2
CV PRESS
Kaggle Score
Forward
.89
1272
.721
Backward
.78
1590
.945
Stepwise
.81
2001
.888
CUSTOM
.87
900
.2345
NOTE 1: ALL ANALYSIS MUST BE DONE IN SAS and all code must be placed in the appendix. Part of the grading process will be to run the code and verify the Kaggle score for each group.
Note 2: An extra 2 points on the final exam will be awarded to the team with the model with the lowest (best) Kaggle Score. In the unlikely event of a tie we will split these points.
Deliverables:
Your group is to turn in a paper that is no more that 7 pages long (without the appendix). Please put your code in the appendix, but any graphs and tables in the body of the paper.
Sample Format
Required deliverables in the complete report. The format of your paper (headers, sections, etc) is flexible although should contain the following information:
Introduction
Data Description
(Where did the data come from? How big is it? How many observations? Where can we find out more? What are the specific variables that we need to know to understand with respect to your analysis?)
Analysis Question 1:
Restatement of Problem
Specify the Model
Checking Assumptions
Residual Plots
Influential point analysis (Cook’s D and Leverage)
Make sure and address each assumption.
Comparing Competing Models
adj R2
Interval CVPress
Parameter Interpretation
Interpretation
Confidence Intervals
Conclusion
A short summary of the analysis.
Analysis Question 2
Restatement of Problem
Model Selection
Type of Selection
Stepwise
Forward
Backward
CUSTOM
Checking Assumptions
Residual Plots
Influential point analysis (Cook’s D and Leverage)
Make sure and address each assumption
Comparing Competing Models
Adj R2
Interval CVPress
Kaggle Score
Conclusion: A short summary of the analysis.
Appendix
Well commented SAS Code for Analysis 1 and 2
Rubric:
Presentation (30%):
Organized paper with title, headings, subheadings, etc.
Labeled plots, figures, tables and charts.
Every plot, figure, table and chart included is referenced in the paper and vice versa.
No spelling or grammatical errors.
Analysis Question 1: (35%)
Analysis Question 2: (35 %)
rm(list=ls())
home_dir <- "~/_smu/_src/home_prices/"
setwd(home_dir)
data_dir <- "./data"
setwd(data_dir)
homes <- read.csv("train.csv", stringsAsFactors = FALSE)
setwd(home_dir)
names(homes) <- tolower(names(homes))
dates <- paste(homes$yrsold, sprintf("%02d", homes$mosold), "01")
homes$sale_date <- strptime(dates, "%Y %m %d")
homes$total_baths <- homes$fullbath + homes$halfbath
#pdf ("homes_train_plots.pdf", width = 10, height = 7)
par (mfrow = c (2,3))
for (i in 2:(length(homes)))
{
if(class(homes[,i]) == "integer")
{
plot (homes[,i],
col = homes[,length(homes)]+1,
main = (names(homes[i])))
boxplot(scale(homes[,i]))
plot(homes$saleprice ~ homes[,i])
}
}
## Warning in if (class(homes[, i]) == "integer") {: la condition a une
## longueur > 1 et seul le premier élément est utilisé
#dev.off()